Defining the characteristics of interferon-alpha–stimulated human genes: insight from expression data and machine learning

Abstract Background A virus-infected cell triggers a signalling cascade, resulting in the secretion of interferons (IFNs), which in turn induces the upregulation of the IFN-stimulated genes (ISGs) that play a role in antipathogen host defence. Here, we conducted analyses on large-scale data relating to evolutionary gene expression, sequence composition, and network properties to elucidate factors associated with the stimulation of human genes in response to IFN-α. Results We find that ISGs are less evolutionary conserved than genes that are not significantly stimulated in IFN experiments (non-ISGs). ISGs show obvious depletion of GC content in the coding region. This influences the representation of some compositions following the translation process. IFN-repressed human genes (IRGs), downregulated genes in IFN experiments, can have similar properties to the ISGs. Additionally, we design a machine learning framework integrating the support vector machine and novel feature selection algorithm that achieves an area under the receiver operating characteristic curve (AUC) of 0.7455 for ISG prediction. Its application in other IFN systems suggests the similarity between the ISGs triggered by type I and III IFNs. Conclusions ISGs have some unique properties that make them different from the non-ISGs. The representation of some properties has a strong correlation with gene expression following IFN-α stimulation, which can be used as a predictive feature in machine learning. Our model predicts several genes as putative ISGs that so far have shown no significant differential expression when stimulated with IFN-α in the cell/tissue types in the available databases. A web server implementing our method is accessible at http://isgpre.cvr.gla.ac.uk/. The docker image at https://hub.docker.com/r/hchai01/isgpre can be downloaded to reproduce the prediction.


Response to Reviewers:
We believe these revisions will address all the concerns raised by you and the reviewers. We apologies for the misunderstanding with respect to the ROC curve and AUC values. We have added these in the legend for each dataset. We have also added ISGPRE to biotools and registered ISGPRE with scicrunch.com and received the RRID:SCR_022730, however it does not yet appear when browsing the scicrunch dashboard. We have also made a few additional clarifications to the text, in particular, after much discussion, we decided to change the 'noisy' feature terminology to poorly performing feature (this was originally called disruptive feature). Here, the ISGs and non-ISGs are taken from dataset S2 (No. = 620 and 874) while the 266 background human genes are from dataset S1 (No. = 10836) ( Table 5). 267 268 To find conserved sequence patterns relating to gene regulation [42], we checked the existence 269 of 2940, 44100 and 661500 short linear nucleotide patterns (SLNPs) consisting of three to five 270 consecutive nucleobases in the group of the ISGs and non-ISGs. By using a positive 5% 271 12 difference in the occurrence frequency as the cut-off threshold, we found 7884 SLNPs with a 272 maximum difference in representation of around 15%. After using Pearson's chi-squared tests 273 and Benjamini-Hochberg correction to avoid type I error in multiple hypotheses [43], 7025 274 SLNPs remained with an adjusted p-value lower than 0.01 (Supplementary Data S4), hereon 275 referred to as "flagged" SLNPs. The differentially represented 7025 SLNPs were ranked 276 according to the adjusted p-value. As shown in Figure 6A (Pearson's chi-squared test: p > 0.05). GC-related dinucleotides, e.g., 'CpC','GpC' and 'GpG' 282 are rarely observed in the flagged SLNPs especially in the top 10 or top 100. In view of this, 283 we hypothesize that the differential representation of nucleotide compositions influences and 284 reflects on the pattern of SLNPs in the ISGs. By checking the co-occurrence status of the 285 flagged SLNPs, we found that these sequence patterns had a cumulative effect in distinguishing 286 the ISGs from non-ISGs, especially when the number of cooccurring SLNPs reached around 287 5320 (Pearson's chi-squared test: p = 7.9E-13, Figure 6B). There were eight (~1.3%) ISGs in 288 dataset S2 containing all the flagged 7025 SLNPs. Their up-regulations after IFN-α treatment 289 were generally low with a fold change fluctuating around 2.2. However, some of these eight 290 genes such as desmoplakin (DSP) were clearly highly up-regulated in endothelial cells isolated 291 from human umbilical cord veins after not only IFN-α treatments (fold change = 11.1) but also 292 IFN-β treatments (fold change = 13.7). We also found some non-ISGs (e.g., hemicentin 1 293 (HMCN1)) and human genes with limited expression in the IFN-α experiments (ELGs) (e.g., 294 tudor domain containing 6 (TDRD6)) containing the flagged SLNPs, but their frequencies were 295 lower than that in the ISGs. genes. Ranks in (A) are generated based on the adjusted p-value given by Pearson's chi-squared 299 tests after the Benjamini-Hochberg correction procedure. Detailed results of the hypothesis 300 tests are provided in Supplementary Data S4. Here, the ISGs and non-ISGs are taken from 301 dataset S2 while the background human genes are from dataset S1 ( Table 5). 302 303 Differences in the protein amino acid sequence 304 We used the amino acid sequences generated by the canonical transcript to extract features at 305 the proteomic level. In addition to the basic composition of 20 standard amino acids, we 306 considered 17 additional features related to physicochemical (e.g., hydropathy and polarity) or 307 geometric properties (e.g., volume) [44,45]. We found several amino acids that were either 308 enriched or depleted in the ISG products compared to the background human proteins, which 309 were produced by genes in dataset S1 (Figure 7). The differences were even more marked 310 between protein products of the ISGs and non-ISGs, highlighting some differences that were 311 not observed when comparing the ISG products to the background human proteins (e.g., 312 isoleucine composition). The differences observed in the amino acid compositions were at least 313 in part associated with the patterns previously observed in features encoded from genetic 314 coding regions. For example, asparagine (N) showed significant over-representation in the ISG 315 products compared to the non-ISG products or background human proteins (Mann-Whitney U 316 test: p = 2.8E-12 and 1.2E-03, respectively). This was expected as there are only two codons, 317 i.e., 'AAT' and 'AAC' coding for amino acid 'N', and dinucleotide 'ApA' showed a 318 remarkable enrichment in the coding region of ISGs. A similar explanation could be given for 319 the relationship between the deficiency of GpG content and amino acid 'G'. The translation of 320 amino acid 'K' was also influenced by ApA composition but was not significant due to the 321 14 mild representation of dinucleotide 'ApG' in the genetic coding region. Additionally, as 322 previously mentioned, the ISGs showed a significant depletion in the CpG content, and 323 consequently, the amino acid 'A' and 'R' in the ISG products were significantly under-324 represented. Cysteine (C) was not frequently observed in human proteins but still showed a 325 relatively significant enrichment in the ISG products (M1 = 2.3%, M2 = 2.5%, p = 1.8E-03). 326 When focusing on the composition of amino acid sequences grouped by 327 physicochemical or geometric properties, we found some features differentially represented 328 between the ISG products and background human proteins. The result showed that hydroxyl 329 (amino acid 'S' and 'T'), amide (amino acid 'N' and 'Q'), or sulfur amino acids (amino acid 330 'C' and 'M') were more abundant in the ISG products compared to the background human 331 proteins (Mann-Whitney U test: p = 0.04, 1.0E-03 and 0.02, respectively). Small amino acids 332 (amino acid 'N', 'C', 'T', aspartic acid (D) and proline (P), the volume ranging from 108.5 to 333 116.1 cubic angstroms) were more frequently observed in the ISG products than in background 334 human proteins (M1 = 22.1%, M2 = 21.7%, p = 0.02). These differences became more marked 335 when comparing the representation of these features between the ISG and non-ISG products. 336 For example, features relating to chemical properties of the side chain (e.g., aliphatic), charge 337 status and geometric volume showed differences between proteins produced by the ISGs and 338 non-ISGs. Some features such as neutral amino acids that include amino acid 'G', 'P', 'S', 'T', 339 histidine (H) and tyrosine (Y) were not differentially represented between the ISG and non-340 ISG products, but they indicated an obvious association with the change of IFN-α-triggered ISGs are taken from dataset S2 (No. = 620 and 874) while the background human genes are 347 from dataset S1 (No. = 10836) ( Table 5) Next, we searched the sequence of the ISG products against that of the non-ISG 363 products to find conserved short linear amino acid patterns (SLAAPs), which might be 364 constrained by strong purifying selection [47]. As opposed to the analysis of the genetic 365 sequence, we only obtained 19 enriched sequence patterns with a Pearson's chi-squared p-value 366 ranging from 1.5E-04 to 0.02 (Table 1), hereon referred to as flagged SLAAPs. They were 367 greatly influenced by four polar amino acids: 'K', 'N', 'E' and 'S', and one nonpolar amino 368 acid: 'L'. Some of these flagged SLAAPs, for example, SLAAP 'NVT' and 'S-N-E', were 369 clearly over-represented in the ISG products compared to the background human proteins and 370 could be used as features to differentiate the ISGs from background human genes. The third 371 16 column in Table 1 indicates a number of patterns that are lacking in the non-ISG products and 372 hence may be the reason for the lack of up-regulation in the presence of IFN-α. Particularly, 373 we noticed that SLAAP 'KEN' was a destruction motif that could be recognised or targeted by 374 anaphase promoting complex (APC) for polyubiquitination and proteasome-mediated 375 degradation [48,49]. Results shown in Figure 8A illustrate that the co-occurrence of 376 differentially represented SLAAPs (flagged) has a cumulative effect in distinguishing the ISGs 377 from non-ISGs. This cumulative effect can even be achieved with only two random SLAAPs 378 respectively. We hypothesize that enriched SLAAPs widely detected in the IDRs may be 389 important for human protein-protein interactions or potentially virus mimicry [53]. For instance, 390 in the ISG products, about 40.8% of SLAAP 'SxNxT' were observed in the IDRs, 14.9% higher 391 than that in non-ISG products ( Table 1). This difference reflected the importance of SLAAP 392 'SxNxT' for target specificity of IFN-α-induced protein-protein interactions (PPIs) [9] even if 393 it was not statistically significant. By contrast, the conditional frequency of SLAAP 'SxNxE' 394 in the IDRs of the ISG and non-ISG products were almost the same, indicating that SLAAP 395 'SxNxE' might have an association with some inherent attributes of the ISGs but was less likely 396 17 to be involved in the IFN-α-induced PPIs. SLAAP 'KEN' in the IDRs also showed some 397 interesting differences: in the non-ISG products, 41.9% of SLAAP 'KEN' were observed in 398 the IDRs, 14.6% higher than that in the ISG products, which provided an effective approach to 399 distinguish the ISGs from non-ISGs. When SLAAP 'KEN' is discovered in the ordered 400 globular region of a protein sequence, statistically, the protein is more likely to be produced by 401 an ISG, but this assumption is reversed if the SLAAP is located in an IDR (Pearson's chi-402  non-ISGs are taken from dataset S2 while the background human genes are from dataset S1 421 (Table 5)

Differences in network profiles 425
We constructed a network with 332,698 experimentally verified interactions among 17603 426 human proteins (confidence score > 0.63) from the Human Integrated Protein-Protein 427 Interaction rEference (HIPPIE) database [55] to investigate if the connectivity among human 428 proteins has an association with genes' expression in the IFN-α experiments. 10169 out of 429 10836 human proteins produced by genes in our background dataset S1 were included in the 430 network. Nodes and edges of this network can be downloaded from our webserver at 431 http://isgpre.cvr.gla.ac.uk/. Based on this network, we calculated eight features as defined in 432 the methods including the average shortest path, closeness, betweenness, stress, degree, 433 neighbourhood connectivity, clustering coefficient, and topological coefficient. 434 As illustrated in Figure 9B/G, ISG products tend to have higher values of betweenness 435 and stress than background human proteins (Mann-Whitney U test: p = 0.01, and 0.03, 436 respectively), which means they are more likely to locate at key paths connecting different 437 nodes of the PPI network. Some ISG products with high values of betweenness and stress, e.g., 438 tripartite motif containing 25 (TRIM25), can be considered as the shortcut or bottleneck of the 439 network and play important roles in many PPIs including those related to the IFN-α-triggered 440 19 immune activities [56,57]. However, such differential representation of betweenness does not 441 mean ISG products are more likely to be or even be close to bottlenecks of the network 442 compared to the background human proteins. Some examples shown in Table 2 indicate that 443 ISG products are less-connected by top-ranked bottlenecks and hubs of the network than non-444 ISG products or the background human proteins. This conclusion is not influenced by the 445 hub/bottleneck protein's performance in the IFN-α experiments. Comparing proteins produced 446 by the ISGs and non-ISGs, we found the former tends to have lower values of clustering 447 coefficient and neighbourhood connectivity (Mann-Whitney U test: p = 0.04 and 7.9E-03, 448

Figure 9D/F). This discovery indicates that the ISG products and some of their interacting 449
proteins are less likely to be targeted by lots of proteins. It also supports the finding that the 450 ISG products are involved in many shortest paths for nodes but are away from hubs or 451 bottlenecks in the network. To some extent, this location also increases the length of the 452 average shortest paths through ISG products in the network ( Figure 9A). 453 When investigating the association between IFN-α-induced gene stimulation and 454 network attributes of gene products, we only found the feature of neighbourhood connectivity 455 was under-represented as the level of differential expression in the presence of IFN increases 456 (PCC = -0.392, p = 2.2E-04). This suggests that proteins produced by genes that are highly up-457 regulated in response to IFN-α are further away from hubs in the PPI networks.

Features highly associated with the level of IFN stimulations 474
In this study, we encoded a total of 397 discrete and 7046 categorical features covering the 475 aspects of evolutionary conservation, nucleotide composition, transcription, amino acid 476 composition, and network profiles. In order to find out some key factors that may enhance or 477 suppress the stimulation of human genes in the IFN-α system, we compared the representation 478 of discrete features of human genes with different but positive Log2(Fold Change). Two 479 features on the co-occurrence of SLNPs and SLAAPs were not taken into consideration here 480 as they were more subjective than the other discrete features and were greatly influenced by The last one, neighbourhood connectivity, was obtained from the network profile of a human 491 interactome constructed based on experimentally verified data in the HIPPIE database [55]. 492 In the positive group, the feature of 'large' amino acid compositions that includes the 493 composition of five amino acids with geometric volume ranging from 163 to 173 cubic 494 angstroms was ranked first for having the highest PCC at 0.593 (Student t-test: p = 2.8E-09). We grouped human genes into two classes based on their response to the IFN-α in the human 525 fibroblast cells. Genes significantly up-regulated in IFN-α experiments were included in the 526 ISG class, while those that did not were put into the non-ISG class. However, there is also 527 another group of human genes down-regulated in the presence of IFN-α, i.e., the IRGs. They 528 were labelled as the non-ISGs, but contain unique patterns that constitute an important aspect 529 of the IFN response [3]. Some of these IRGs were not up-regulated in any known type I IFN 530 systems, thus they have been placed in a refined non-ISG class for analyses and predictions. 531 Additionally, there are a number of genes that have insufficient levels of expression in the 532 experiments to determine a fold change, i.e., ELGs. Here, we used the previously defined 533 features to compare the ISGs from dataset S2 with the IRGs and ELGs divided from the 534 background dataset S1 (Table 5). 535 As shown in Figure 12, the IRGs are differentially represented to a lower extent in the 536 majority of nucleotide 4-mer compositions than the ISGs, indicating the deficiency of some 537 23 nucleotide sequence patterns in the coding region of IRGs. Note that, many nucleotide 4-mer 538 composition features are more suppressed in the ISGs than non-ISGs although the differences 539 are small. The biased representation of these features in the IRGs suggests that the IRGs have 540 characteristics similar to the ISGs rather than non-ISGs. Additionally, there are a very limited 541 number of features relating to evolutionary conservation, nucleotide sequence compositions or 542 codon usage showing obvious differences between the ISGs and IRGs, but many of them are 543 differentially represented when comparing the ISGs with non-ISGs. Therefore, involving the 544 IRGs in the class of the non-ISGs will increase the risk for machine learning models to produce 545 more false positives. However, there are some informative features differentiating the IRGs 546 from ISGs. For example, compared to the ISGs, the IRGs are more enriched in CpGs (Mann-547 Whitney U test: p = 5.6E-03), which is also mentioned in [59]. The IRGs tend to have higher 548 closeness centrality and neighbourhood connectivity than the ISGs (Mann-Whitney U test: p = 549 0.04 and 6.4E-06 respectively), suggesting that the IRGs are more central in the human PPI 550 network and connected to key proteins with many interaction partners. Differences in some 551 amino acid composition features between the ISGs and IRGs can also be observed in Figure  552 12. Therefore, good predictability is still expected when using features extracted from protein 553 sequences. 554 human genes are from dataset S1 (No. = 10836) ( Table 5). 573 574 Implementation with machine learning framework 575 In this study, we encoded 397 discrete and 7046 categorical features for the analyses. As excess 576 of features will greatly increase the dimension of feature spaces and complicate the 577 classification task for the classifier, we limited the number of SLNPs to the top 100 based on 578 the adjusted p-value and we expected these to be sufficient to provide a picture of short linear 579 sequence patterns in the coding region of the canonical transcript. Accordingly, features 580 measuring the co-occurrence status of multiple SLNPs were recalculated based on the selected 581 100 SLNPs. As a result, we prepared 518 features (Supplementary Data S5) for our machine 582 learning framework. To reduce the impact of noisy data on classifications, we only used the 583 refined ISGs and non-ISGs from dataset S2 for training and modelling. 584 Table 3 firstly shows the comparisons of prediction performance among different 585 machine learning methods. The threshold is determined by maximising the value of MCC. As 586 the random forest (RF) classifier was built based on randomly selected samples and features 587

25
[60], we repeated its modelling procedure ten times. These initial comparisons showed that the 588 support vector machine (SVM) [61] is superior to k-nearest neighbors (KNN) and RF [60]. The 589 poor prediction performance of the best base classifier (SVM, AUC = 0.6509) indicates that 590 there are a number of poorly performing features hidden in the set. As some genes respond to 591 IFNs in a cell-specific manner [2], it is hard to produce predictions unless we detect key 592 discriminating features, which are robust to the change of biological environment. 593 Here, we considered two iterative strategies for selecting predictive features. The first 594 one is the forward feature selection (FFS) [62] wherein features are added sequentially based 595 on their individual performance. This strategy did not work well (Table 3) as the prediction 596 performances were all poor when the feature was used individually (Supplementary Data S5). 597 The second strategy is developed based on the backward feature elimination scheme but uses 598 fewer iterations to achieve the end result, namely AUC-driven subtractive iteration algorithm 599 (ASI) (Figure 15). 600 Pre-processing using the ASI algorithm showed that there were at least 28% of features 601 disrupting the prediction model. The loss of some of the individual nucleotide 4-mer feature 602 seemed not to influence the performance of the classifier at this stage, but the similarities 603 between IRGs and ISGs (Figure 12) particularly in the 4-mer features was a cause for concern 604 when the model was used to predict new data, especially unknown IRGs. 605 When using the ASI algorithm, the number of disrupting features did not stabilise until 606 the algorithm reached the 11-th iteration. The remaining 74 features constituted our optimum 607 feature set for predicting the ISGs (Table 4). Among them, 14 and 9 features displayed positive 608 and negative correlations with the level of up-regulation in IFN-α experiments (p < 0.05). 609 During the procedure, the AUC kept increasing steadily and reached 0.7479 at the end (Table  610 3). The Matthews correlation coefficient (MCC) also showed an overall improvement although 611 it fluctuated slightly during the last few iterations. By ranking the scores calculated by the 612 26 prediction model, we found 68.1% of the 496 genes (equal to the number of ISGs in the training 613 dataset) were successfully predicted as the ISGs. Figure 13B illustrates the distribution of 614 prediction scores generated by the ASI-optimised model for human genes with different 615 expressions in IFN-α experiments. Human genes with higher up-regulation in IFN-α 616 experiments tend to obtain higher prediction scores from our optimised machine learning 617 model (PCC = 0.243, p = 4.2E-10). 618 However, there were also some ISGs incorrectly predicted by our model even though 619 they were highly up-regulated, for example, basic leucine zipper ATF-like transcription factor 620 2 (BATF2, prediction score = 0.34). The model produced 33 ISGs with a prediction score 621 higher than 0.8 but this number for the non-ISGs reduced to six, including one IRG (tripartite 622 motif containing 59 (TRIM59)). The highest prediction score within the non-ISGs was found 623 on ubiquitin conjugating enzyme E2 R2 (UBE2R2, prediction score = 0.88). It contains many 624 features similar to the ISGs but was not differentially expressed in the presence of IFN-α in the 625 human fibroblast cells [3]. The lowest prediction score within ISGs was found on cap 626 methyltransferase 1 (CMTR1, prediction score = 0.12) due to the weak signal from its features. 627 For instance, CMTR1 protein does not contain any ISG-favoured SLAAPs listed in Table 1. 628 The influence of the IRGs on the prediction was reflected in the training dataset but was not 629 significant. Compared with human genes not differentially expressed in the IFN-α experiments 630     (

Review of different testing datasets 656
In this study, we trained and optimised a SVM model from our training dataset S2', and 657 prepared seven testing datasets (dataset S2''/S3/S4/S5/S6/S7/S8) to assess the generalisation 658 capability of our model under different conditions ( Table 5). The S2'' testing dataset was a 659 subset of dataset S2. The prediction performance on this testing dataset was close to that in the 660 training stage with an AUC of 0.7455 ( Figure 14A). The best MCC value (0.345) was achieved 661 when setting the judgement threshold to 0.438, which meant that the prediction model was 662 sensitive to signals related to ISGs. In this case, it performed predictions with high sensitivity 663 but inevitably produced many false positives, especially within IRGs. 664 In the S3 testing dataset, we used 695 ISGs with low confidence. The overall accuracy 665 (equals to sensitivity as there were no negatives) only reached 44.0% when using a judgement 666 threshold of 0.549, about 0.18 lower than SN under the same threshold in the training dataset 667 S2' (Table 3). It is expected as some of their inherent attributes make them slightly up-668 regulated, silent or even repressed (e.g., become non-ISGs in other IFN systems) in response 669 to some IFN-triggered signalling. On this testing dataset, our machine learning model produced 670 38 (5.5%) ISGs with a prediction score higher than 0.8. This number was also lower than that 671 29 on the training dataset S2'. It further indicates the relatively low confidence for the ISGs 672 included in dataset S3. 673 The S4 testing dataset was constructed to illustrate our hypothesis that there are some 674 patterns shared among the ISGs and IRGs at least in the IFN-α system in the human fibroblast 675 cells. On this testing dataset, the prediction accuracy (equals to SP as there were no positives) 676 was 60.2% under the judgement threshold of 0.549, about 0.15 lower than the SP under the 677 same threshold in the training dataset S2' ( Table 3). Leucine rich repeat containing 2 (LRRC2), 678 carbohydrate sulfotransferase 10 (CHST10) and eukaryotic translation elongation factor 1 679 epsilon 1 (EEF1E1) showed strong signals of being ISGs (probability score > 0.9). In total, 680 there were 56 (5.6%) IRGs being incorrectly predicted as ISGs with prediction scores higher 681 than 0.8. This high score was found in an estimated 8.1% of the ISGs but was only observed 682 in 1.2% of human genes not differentially expressed in the IFN-α experiments (Figure 13B). 683 These results indicate that there is a considerable number of IRGs incorrectly predicted as ISGs 684 in the S4 testing dataset due to their close distance to the ISGs in the high-dimensional feature 685 space. This may be the case for many other datasets including dataset S2'', S5, S6, S7, and S8. 686 It also supports our hypothesis about the shared patterns from the machine learning aspect and 687 is consistent with the results shown in Figure 12. 688 The next three testing datasets (S5, S6, and S7) were collected from the Interferome 689 database [24] to test the applicability of the machine learning model across different IFN types. 690 The ISGs in these testing datasets were all highly up-regulated (Log2(Fold Change) > 1.0) in 691 the corresponding IFN systems while all the non-ISGs were not up-regulated after 692 corresponding IFN treatments (Log2(Fold Change) < 0). The results shown in Figure 14 reveal 693 that the ISGs triggered by type I or III IFN signalling can still be predicted by our machine 694 learning model, but the performance is limited to some extent (AUC = 0.6677 and 0.6754 695 30 respectively). However, it is almost impossible to make normal predictions with the current 696 feature space for human genes up-regulated by type II IFNs (AUC = 0.5532). This was approximately 0.21 lower than the SN under the same threshold in the training dataset 711 S2' (Table 3). It suggests that there are more non-ISGs than ISGs in this dataset, which is 712 consistent with the results shown in Figure 12. Particularly, we found ten ELGs with prediction 713 scores higher than 0.9: CD48 molecule, CD53 molecule, lipocalin 2 (LCN2), uncoupling 714 protein 1 (UCP1), coiled-coil domain containing 68 (CCDC68), potassium calcium-activated 715 channel subfamily M regulatory beta subunit 2 (KCNMB2), potassium voltage-gated channel 716 interacting protein 4 (KCNIP4), zinc finger HIT-type containing 3 (ZNHIT3), serpin family B 717 member 4 (SERPINB4), and fibrinogen silencer binding protein (FSBP). By retrieving data 718 from the Genotype-Tissue Expression project [65], we found that the expression of these ELGs 719 was generally limited with the exception of CD53 and ZNHIT3 (Figure 15). The expression 720 31 data of CD53 were not included in the OCISG database [3] and were also limited in the 721 Interferome database [24]. It only showed slight up-regulation after type I IFN treatments in 722 blood, liver, and brain but there is currently no record of its expression level in the presence of 723 IFN-α in the human fibroblast cells. ZNHIT3 is another well-expressed gene lacking 724 information in the OCISG. In the Interferome database [24], we found that ZNHIT3 could be 725 up-regulated after IFN treatments in some fibroblast cells on the skin. As for the remaining 726 eight ELGs, despite their limited expression in the human fibroblast cells, their features suggest 727 that they are very likely to be IFN-α stimulated in a currently untested cell type.

738
In this study, we investigated the characteristics that influence the expression of human genes 739 in IFN-α experiments. We compared the ISGs and non-ISGs through multiple procedures to 740 guarantee strong signals for the ISGs and to avoid cell-specific influences that resulted in the 741 lack of ISG expression in certain cell types [2]. Even some highly up-regulated ISGs can 742 become down-regulated when the biological conditions change, exemplified by the 743 performance of C-X-C motif chemokine ligand 10 (CXCL10) on liver biopsies after IFN-α 744 treatment. This refinement is necessary as the representation of features between the ISGs and 745 32 background human genes shows that many non-ISGs especially IRGs have similar feature 746 patterns to the ISGs (Figure 12). 747 Generally, the ISGs are less evolutionarily conserved and include more human 748 paralogues than the non-ISGs. They have specific nucleotide patterns exemplified by the 749 depletion of GC-content and have a unique codon usage preference in coding proteins. There 750 are a number of SLNPs widely observed in the cDNA of the ISGs which are relatively rare in 751 the non-ISGs (Supplementary Data S4). Likewise, there are also many SLAAPs highlighted 752 in the sequences of ISG products that are absent or rare in the non-ISG products ( Table 1). In 753 the human PPI network, the ISG products tend to have higher betweenness than the background 754 human protein. Abnormal expression or knockout of these proteins will increase the diameter 755 of the network and may lead to some lethal consequences that are not tolerated in signalling 756 pathways [67-69]. These ISG-specific patterns may be the result of the evolution of the innate 757 immune system in vertebrates and could be adaptations to the cellular environment induced by 758 interferon following a pathogenic infection [70]. It is also possible that some of the particular 759 SLNPs and SLAAPs may be functionally important as the cell changes from non-infected to 760 infected. Experimental evidence will be necessary to investigate this. 761 Some inherent properties of the ISGs facilitate or elevate their expression after IFN-α 762 treatments but may also be used by viruses to escape from IFN-α-mediated antiviral response 763 [22]. For instance, we found that a higher dN/dS ratio was positively correlated with gene up-764 regulation following IFN-α treatments (Figure 10). This suggests ISGs are on average under 765 stronger adaptive evolutionary selection pressure than the non-ISGs possibly linked to their 766 evolution as anti-viral molecules. , It will also facilitate the virus to interfere with IFN-α 767 signalling through the JAK-STAT pathway and inactivate downstream cellular factors 768 involved in IFN-α signal transductions [22]. We found arginine was under-represented in the 769 ISG products compared to the non-ISG products. As arginine is essential for the normal 770 33 proliferation and maturation of human T cells [72], such depletion in the ISG products may 771 leave a risk of inhibiting T-cell function and potentially increase susceptibility to infections 772 [73]. Furthermore, the special pattern of the ISGs also promotes the representation of some 773 features even if they are not well represented in nature, for example, the higher cysteine 774 composition in the ISGs. We hypothesize that it may be helpful to activate T-cell to regulate 775 protein synthesis, proliferation and secretion of immunoregulatory cytokines [74,75]. There 776 are also some features (e.g., methionine composition) not differentially represented between 777 the ISGs and non-ISGs but play important roles in IFN-α-mediated immune responses. For 778 example, there is evidence for the methionine content playing a role in the biosynthesis of S-779 Adenosylmethionine (SAM), which can improve interferon signalling in cell culture [76,77]. 780 As previously mentioned, there were similar patterns between the feature representation 781 of the ISGs and IRGs, which led to the unclear boundary for the ISGs and non-ISGs in the 782 feature space. We found significant differences in the representation of features on 783 evolutionary conservation (Figure 4) between the ISGs and non-ISGs, but they became non-784 significant when comparing the ISGs with IRGs. Similar phenomena were observed on many 785 features deciphered from the canonical transcript, e.g., dinucleotide composition and codon 786 usage features. We hypothesis that IRGs are former ISGs that have evolved to be down 787 regulated to avoid any unintended harmful consequences. Furthermore, despite so many 788 similarities between the ISGs and IRGs, the separate classification of these genes is still 789 possible. 4-mer compositions can be considered as the key features as most of them are 790 differentially represented between ISGs and IRGs (Figure 12). Using proteomic features can 791 also help to differentiate the ISGs from IRGs but is not as predictive as using 4-mer features. 792 In the machine learning framework, we developed the ASI algorithm to remove poorly 793 performing features but kept features that do not influence prediction performance when 794 removed individually from iterations. Features may have synergistic effects on the prediction 795 34 performance. The elimination of some specific features may ruin such improvement even when 796 they were individually uninformative for the improvement of the classifier. In this case, 797 keeping as many useful features as possible seems to be a reasonable option but will greatly 798 increase the dimension of the feature space and increase the risk of overfitting [78]. By contrast, 799 our ASI algorithm avoided such a risk and kept the synergistic effect of different features 800 through iterations. 801 In the prediction task, we found some previously labelled non-ISGs with very high 802 prediction scores, suggesting that they had someinherent properties consistent with them being 803 stimulated after IFN-α treatments. Some, for example, UBE2R2 has been shown to be 804 significantly up-regulated after IFN-α treatment [79]. The non-ISG label had been assigned 805 because the relevant expression data in the presence of IFN-α were not included in the OCISG 806 [3] and Interferome databases [24]. We also found ten ELGs with very high prediction scores 807 (> 0.9). Literature searches on these genes indicate that they are likely to be involved in the 808 innate immune response [80,81]. Their responses may be limited to certain tissues or cell types for other ELGs is harder to assess, particularly those for which expression is absent in a range 816 of tissues (e.g., UCP1 in Figure 15). UCP1 is a mitochondrial carrier protein expressed in 817 brown adipose tissue (BAT) responsible for non-shivering thermogenesis [85]. It is possible 818 that UCP1 is stimulated directly or indirectly by IFN-α in BAT, resulting in the defended 819 elevation of body temperature in response to infection. 820 35 We developed the machine learning model based on experimental data from the human 821 fibroblast cells stimulated by IFN-α. It can be generalised to type I or III IFN systems, 822 presumably because activations of type I and III ISGs are both controlled by ISRE [9] and aim 823 to regulate host immune response [10][11][12]. However, our model cannot be used for predictions 824 in the type II IFN system (AUC = 0.5532, best MCC = 0.083, Figure 14). It may be caused by 825 the different control elements and the different roles in human immune activities [14]. One 826 feasible strategy is to reclassify the ISGs/non-ISGs based on the IFN experiments in the type 827 II IFN system. Using only the overlapping ISGs and non-ISGs in both type I and type II IFN 828 system for modelling could be another solution. In summary, our analyses highlight some key 829 sequence-based features that are helpful to distinguish the ISGs from non-ISGs, or IRGs. While The exclusion of these non-ISGs could effectively reduce the risk of involving false negatives 862 in analyses and producing false positives in predictions. As a result, the refined dataset S2 863 contains 620 ISGs and 874 non-ISGs with relatively high confidence. 864 The training procedure in the machine learning framework was conducted on the 865 balanced dataset S2'. It consisted of 992 randomly selected ISGs and non-ISGs from dataset 866 S2. The remaining human genes in S2 were used for independent testing. Additionally, we also 867 constructed another six testing datasets for the purpose of review and assessment. Dataset S3 868 contained 695 ISGs with low confidence compared to those ISGs in dataset S2. Some of them 869 could be non-ISGs or even IRGs in the type I IFN system. Dataset S4 contained 1006 IRGs 870 37 from the human fibroblast cell experiments. Dataset S5, S6, and S7 were constructed based on 871 records for experiments in type I, II, and III IFN systems from Interferome 872 (RRID:SCR_007743) [24]. The criterion for an ISG in the latter three datasets was a high level 873 of up-regulation (Log2(Fold Change) > 1.0) while that for non-ISGs was no up-regulation after 874 IFN treatments (Log2(Fold Change) < 0). The last testing dataset S8 was derived from our 875 background dataset S1, containing 2217 ELGs. A breakdown of the aforementioned eight 876 datasets is shown in Table 5. Detailed information of the human genes used in this study is 877 provided in Supplementary Data S1. The cDNA and protein sequences are accessible at 878 http://isgpre.cvr.gla.ac.uk/. 879 880   The difference between ISGs and non-ISGs The difference between ISGs and human genes The feature of nucleotide compositions (7) The feature of dinucleotide compositions (16) The feature of codon usages (64) The feature of nucleotide 4-mer compositions (256)

Figure 5
Click here to access/download; Figure;Figure_5 Lower bound of co-occurrence of SLNPs The difference between ISG and non-ISG products The difference between ISG products and human proteins   Click here to access/download; Figure;Figure_9  The difference between ISGs and ELGs The feature of nucleotide compositions (7) The feature of dinucleotide compositions (16) The feature of codon usages (64) The feature of nucleotide 4-mer compositions (256) The feature about evolution (7) The feature of amino acid composition (37) The feature about interactome network (8)

≤ ≤
The difference between ISGs and non-ISGs Log10(TPM+1) On behalf of my co-authors please consider the resubmission of our research article entitled 'Defining the characteristics of interferon-alpha-stimulated human genes: insight from expression data and machine learning' for consideration in your journal. We believe these revisions will address all the concerns raised by you and the reviewers. We apologies for the misunderstanding with respect to the ROC curve and AUC values.
We have added these in the legend for each dataset. We have also added ISGPRE to biotools and registered ISGPRE with scicrunch.com and received the RRID:SCR_022730, however it does not yet appear when browsing the scicrunch dashboard. We have also made a few additional clarifications to the text, in particular, after much discussion, we decided to change the 'noisy' feature terminology to poorly performing feature (this was originally called disruptive feature).
We confirm that all authors have read and approved this version of the manuscript.

Joseph Hughes
Personal Cover Click here to access/download;Personal Cover;Cover_letter-2.docx